Tagging Urdu Text with Parts of Speech: A Tagger Comparison
نویسندگان
چکیده
In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.
منابع مشابه
Automated part - of - speech analysis of Urdu : conceptual and technical issues
Part-of-speech (POS) tagging is the process of labelling tokens in a text with tags that indicate their morphosyntactic category, and has a wide range of applications in computational and corpus linguistics, such as the production of corpus-based dictionaries and grammars. This paper describes an experiment in extending POS tagging to a hitherto untagged language, Urdu. The most challenging tas...
متن کاملMorphological Ending – based Strategies of Unknown Word Estimation for Statistical POS Urdu Tagger
Natural language processing has widely used Statistical based language models to solve disambiguation problems. Over the past decades different techniques regarding POS tagging have been proposed for English, European and East Asian languages. In this paper our focus is POS tagging for Urdu due to the infancy stage of Urdu language based tagging system. We have combined two approaches (Statisti...
متن کاملTagger Voting for Urdu
In this paper, we focus on improving part-of-speech (POS) tagging for Urdu by using existing tools and data for the language. In our experiments, we use Humayoun’s morphological analyzer, the POS tagging module of an Urdu Shallow Parser and our own SVM Tool tagger trained on CRULP manually annotated data. We convert the output of the taggers to a common format and more importantly unify their t...
متن کاملPart of Speech Tagging - A solved problem?
Since 100 B.C. humans are aware that the language consists of several distinct parts, called parts-of-speech. Identifying those parts-of-speech plays a crucial role in many fields of linguistics. Since TAGGIT, the first large-scale part-of-speech tagger, many algorithms and methods have been developed. Such include rule-based, probabilistic and hybrid taggers. When tagging large text corpora so...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کامل